基于Transformer的单目标跟踪研究进展

doi:10.16451/j.cnki.issn1003-6059.202604004

Abstract
Figure/Table
References (150)
Related Citation (15)

Download: PDF (1907 KB) HTML (1 KB)
Export: BibTeX | EndNote (RIS)

Abstract

Single object tracking（SOT） is recognized as one of the fundamental tasks in the field of computer vision. However, traditional correlation filters and Siamese network architectures struggle to meet the growing demands for accuracy and robustness in complex and dynamic environments. Transformer exhibits significant advantages in SOT by virtue of its powerful global modeling capability. Therefore, the recent research advances in Transformer-based SOT are reviewed systematically. Based on the overall pipeline design, existing tracking algorithms can be categorized into two primary types： two-stream two-stage algorithms and one-stream one-stage algorithms. Representative algorithms of each category are analyzed in depth to highlight their relations and characteristics, while the research status of lightweight Transformer-based tracking methods is summarized. In addition, recent emerging trends, such as Mamba-based tracking algorithms and unified model architectures, are further investigated and their promising potential in model efficiency and generalizability is discussed as well. The performance of different Transformer-based tracking methods is comprehensively analyzed and evaluated on multiple mainstream datasets. Finally, several promising directions for future research in SOT, including lightweight models, multimodal fusion, long-term tracking and foundation models-driven tracking, are outlined, providing valuable references for the research and development of SOT.

Key words： Computer Vision Transformer Two-Stream Two-Stage Tracking Algorithm One-Stream One-Stage Tracking Algorithm

Received: 22 December 2025

ZTFLH:

TP391.41

Fund:

National Natural Science Foundation of China（No.62402449,62272419）, Major Program of Natural Science Foundation of Zhejiang Province（No.LD26F020003）, Open Project of Anhui Provincial Key Laboratory of Multimodal Cognitive Com-putation of Anhui University（No.MMC202409）

Corresponding Authors: ZHENG Zhonglong, Ph.D., professor. His research inte-rests include pattern recognition, machine learning and image processing.

About author:: About Author:ZHANG Dawei, Ph.D., associate profe-ssor. His research interests include deep lear-ning, computer vision and pattern recognition.XU Dongsheng, Master student. His research interests include computer vision and object tracking.YU Zhechen, Ph.D. candidate. His research interests include artificial intelligence and computer vision.JIANG Kaiwei, Master student. Her research interests include deep learning and computer vision.TIAN Weigang, Master student. His research interests include computer vision and object tracking.

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	ZHANG Dawei
	XU Dongsheng
	YU Zhechen
	JIANG Kaiwei
	TIAN Weigang
	ZHENG Zhonglong

Cite this article:

ZHANG Dawei,XU Dongsheng,YU Zhechen等. Research Progress of Single Object Tracking Based on Transformer[J]. Pattern Recognition and Artificial Intelligence, 2026, 39(4): 348-378.

URL:

http://manu46.magtech.com.cn/Jweb_prai/EN/10.16451/j.cnki.issn1003-6059.202604004 OR http://manu46.magtech.com.cn/Jweb_prai/EN/Y2026/V39/I4/348

[1] 韩瑞泽,冯伟,郭青, 等.视频单目标跟踪研究进展综述. 计算机学报, 2022, 45(9): 1877-1907.
(HAN R Z, FENG W, GUO Q, et al. Single Object Tracking Research: A Survey. Chinese Journal of Computers, 2022, 45(9): 1877-1907.)
[2] 田永林,王雨桐,王建功,等.视觉Transformer研究的关键问题:现状及展望.自动化学报, 2022, 48(4): 957-979.
(TIAN Y L, WANG Y T, WANG J G, et al. Key Problems and Progress of Vision Transformers: The State of the Art and Prospects. Acta Automatica Sinica, 2022, 48(4): 957-979.)
[3] 张天路,张强.基于深度学习的RGB-T目标跟踪技术综述.模式识别与人工智能, 2023, 36(4): 327-353.
(ZHANG T L, ZHANG Q.A Survey of RGB-T Object Tracking Technologies Based on Deep Learning. Pattern Recognition and Artificial Intelligence, 2023, 36(4): 327-353.)
[4] JIAO L C, ZHANG X, LIU X, et al. Transformer Meets Remote Sensing Video Detection and Tracking: A Comprehensive Survey. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023, 16: 1-45.
[5] KUGARAJEEVAN J, KOKUL T, RAMANAN A, et al. Transformers in Single Object Tracking: An Experimental Survey. IEEE Access, 2023, 11: 80297-80326.
[6] 闵志方,杜虎,朱雪琼,等.单目标跟踪研究综述.光学与光电技术, 2023, 21(4): 1-14.
(MIN Z F, DU H, ZHU X Q, et al. Survey of Single Target Trac-king Research. Optics & Optoelectronic Technology, 2023, 21(4): 1-14.)
[7] 孙子文,钱立志,杨传栋,等.基于Transformer的视觉目标跟踪方法综述.计算机应用, 2024, 44(5): 1644-1654.
(SUN Z W, QIAN L Z, YANG C D, et al. Survey of Visual Object Tracking Methods Based on Transformer. Journal of Computer Applications, 2024, 44(5): 1644-1654.)
[8] 陈泷,石磊,黎智辉,等.基于深度学习的无人机单目标跟踪综述.计算机科学与探索, 2026, 20(1): 40-65.
(CHEN L, SHI L, LI Z H, et al. Survey of Deep Learning-Based UAV Single Object Tracking. Journal of Frontiers of Computer Science and Technology, 2026, 20(1): 40-65.)
[9] BOLME D S, BEVERIDGE J R, DRAPER B A, et al. Visual Object Tracking Using Adaptive Correlation Filters // Proc of the IEEE Computer Society Conference on Computer Vision and Pattern Re-cognition. Washington, USA: IEEE, 2010: 2544-2550.
[10] LI Y, ZHU J K. A Scale Adaptive Kernel Correlation Filter Trac-ker with Feature Integration // Proc of the 13th European Confe-rence on Computer Vision. Berlin, Germany：Springer, 2014, II: 254-265.
[11] DANELLJAN M, HÄGER G, KHAN F S, et al. Accurate scale estimation for robust visual tracking[C/OL].[2025-11-10]. https://www.cvl.isy.liu.se/research/objrec/visualtracking/scalvistrack/ScaleTracking_BMVC14.pdf.
[12] HENRIQUES J F, CASEIRO R, MARTINS P, et al. High-Speed Tracking with Kernelized Correlation Filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(3): 583-596.
[13] DANELLJAN M, ROBINSON A, KHAN F S, et al. Beyond Co-rrelation Filters: Learning Continuous Convolution Operators for Visual Tracking // Proc of the 14th European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 472-488.
[14] DANELLJAN M, BHAT G, KHAN F S, et al. ECO: Efficient Convolution Operators for Tracking // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 6931-6939.
[15] HUANG C, LUCEY S, RAMANAN D. Learning Policies for Adap-tive Tracking with Deep Feature Cascades // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2017: 105-114.
[16] NAM H, HAN B. Learning Multi-domain Convolutional Neural Net-works for Visual Tracking // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 4293-4302.
[17] BERTINETTO L, VALMADRE J, HENRIQUES J F, et al. Fully-Convolutional Siamese Networks for Object Tracking // Proc of the 14th European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 850-865.
[18] ZHU Z, WANG Q, LI B, et al. Distractor-Aware Siamese Networks for Visual Object Tracking // Proc of the 15th European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 103-119.
[19] WANG Q, ZHANG L, BERTINETTO L, et al. Fast Online Object Tracking and Segmentation: A Unifying Approach // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 1328-1338.
[20] XU Y D, WANG Z Y, LI Z X, et al. SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 12549-12556.
[21] ZHANG D W, FU Y W, ZHENG Z L. UAST: Uncertainty-Aware Siamese Tracking // Proc of the 39th International Conference on Machine Learning. San Diego, USA: JMLR, 2022: 26161-26175.
[22] GUO D Y, SHAO Y Y, CUI Y, et al. Graph Attention Tracking // Proc of the IEEE/CVF Conference on Computer Vision and Pa-ttern Recognition. Washington, USA: IEEE, 2021: 9538-9547.
[23] DANELLJAN M, BHAT G, KHAN F S, et al. ATOM: Accurate Tracking by Overlap Maximization // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 4655-4664.
[24] BHAT G, DANELLJAN M, VAN GOOL L, et al. Learning Discriminative Model Prediction for Tracking // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2019: 6181-6190.
[25] VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need[C/OL].[2025-11-10]. https://arxiv.org/pdf/1706.03762.
[26] WANG N, ZHOU W G, WANG J, et al. Transformer Meets Trac-ker: Exploiting Temporal Context for Robust Visual Tracking // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 1571-1580.
[27] CHEN X, YAN B, ZHU J W, et al. Transformer Tracking // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Re-cognition. Washington, USA: IEEE, 2021: 8122-8131.
[28] YAN B, PENG H W, FU J L, et al. Learning Spatio-Temporal Transformer for Visual Tracking // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 10428-10437.
[29] XIE F, WANG C Y, WANG G T, et al. Learning Tracking Representations via Dual-Branch Fully Transformer Networks // Proc of the IEEE/CVF International Conference on Computer Vision Workshops. Washington, USA: IEEE, 2021: 2688-2697.
[30] LIN L T, FAN H, ZHANG Z P, et al. SwinTrack: A Simple and Strong Baseline for Transformer Tracking // Proc of the 36th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2022: 16743-16754.
[31] YE B T, CHANG H, MA B P, et al. Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework // Proc of the 17th European Conference on Computer Vision. Berlin, Germany: Springer, 2022: 341-357.
[32] CHEN B Y, LI P X, BAI L, et al. Backbone Is All Your Need: A Simplified Architecture for Visual Object Tracking // Proc of the 17th European Conference on Computer Vision. Berlin, Germany: Springer, 2022: 375-392.
[33] RADFORD A, NARASIMHAN K, SALIMAN S T, et al. Improving Language Understanding by Generative Pre-training[C/OL].[2025-11-10]. https://gwern.net/doc/www/s3-us-west-2.amazonaws.com/d73fdc5ffa8627bce44dcda2fc012da638ffb158.pdf.
[34] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // Proc of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies(Long and Short Papers). Stroudsburg, USA: ACL, 2019: 4171-4186.
[35] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale[C/OL].[2025-11-10]. https://arxiv.org/pdf/2010.11929.
[36] LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 9992-10002.
[37] CARION N, MASSA F, SYNNAEVE G, et al. End-to-End Object Detection with Transformers // Proc of the 16th European Confe-rence on Computer Vision. Berlin, Germany: Springer, 2020: 213-229.
[38] ZHU X Z, SU W J, LU L W, et al. Deformable DETR: Defor-mable Transformers for End-to-End Object Detection[C/OL].[2025-11-10]. https://arxiv.org/pdf/2010.04159.
[39] ZHENG S X, LU J C, ZHAO H S, et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 6877-6886.
[40] WANG W H, XIE E Z, LI X, et al. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 548-558.
[41] WU Y, LIM J, YANG M H. Object Tracking Benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1834-1848.
[42] MUELLER M, SMITH N, GHANEM B. A Benchmark and Simulator for UAV Tracking // Proc of the 14th European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 445-461.
[43] MUELLER M, BIBI A, GIANCOLA S, et al. TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild // Proc of the 15th European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 310-327.
[44] FAN H, LIN L T, YANG F, et al. LaSOT: A High-Quality Ben-chmark for Large-Scale Single Object Tracking // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 5369-5378.
[45] HUANG L H, ZHAO X, HUANG K Q. GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(5): 1562-1577.
[46] PENG L, GAO J Y, LIU X R, et al. VastTrack: Vast Category Visual Object Tracking // Proc of the 38th International Confe-rence on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2024: 130797-130818.
[47] WANG X, SHU X J, ZHANG Z P, et al. Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 13758-13768.
[48] YAN S, YANG J Y, KÄPYLÄ J, et al. DepthTrack: Unveiling the Power of RGBD Tracking // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 10725-10733.
[49] WANG X, LI J N, ZHU L, et al. VisEvent: Reliable Object Tra-cking via Collaboration of Frame and Event Flows. IEEE Transactions on Cybernetics, 2024, 54(3): 1997-2010.
[50] LI C L, XUE W L, JIA Y Q, et al. LasHeR: A Large-Scale High-Diversity Benchmark for RGBT Tracking. IEEE Transactions on Image Processing, 2022, 31: 392-404.
[51] WANG Q, TENG Z, XING J L, et al. Learning Attentions: Resi-dual Attentional Siamese Network for High Performance Online Vi-sual Tracking // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 4854-4863.
[52] YU Y C, XIONG Y L, HUANG W L, et al. Deformable Siamese Attention Networks for Visual Object Tracking // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 6727-6736.
[53] FU Z H, LIU Q J, FU Z H, et al. STMTrack: Template-Free Vi-sual Tracking with Space-Time Memory Networks // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 13769-13778.
[54] YU B, TANG M, ZHENG L Y, et al. High-Performance Discriminative Tracking with Transformers // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 9836-9845
[55] MAYER C, DANELLJAN M, BHAT G, et al. Transforming Model Prediction for Tracking // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 8721-8730.
[56] CHEN X, YAN B, ZHU J W, et al. High-Performance Transfor-mer Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(7): 8507-8523.
[57] CAO Z A, HUANG Z Y, PAN L, et al. TCTrack: Temporal Contexts for Aerial Tracking // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 14778-14788.
[58] CAO Z A, HUANG Z Y, PAN L, et al. Towards Real-World Vi-sual Tracking with Temporal Contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(12): 15834-15849.
[59] CAO Z A, FU C H, YE J J, et al. HiFT: Hierarchical Feature Transformer for Aerial Tracking // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 15437-15446.
[60] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet Cla-ssification with Deep Convolutional Neural Networks // Proc of the 26th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2012: 1106-1114.
[61] XING D T, EVANGELIOU N, TSOUKALAS A, et al. Siamese Transformer Pyramid Networks for Real-Time UAV Tracking // Proc of the IEEE/CVF Winter Conference on Applications of Computer Vision. Washington, USA: IEEE, 2022: 1898-1907.
[62] GUO M Z, ZHANG Z P, FAN H, et al. Learning Target-Aware Representation for Visual Tracking via Informative Interactions // Proc of the 31st International Joint Conference on Artificial Intelligence. San Francisco, USA: IJCAI, 2022: 927-934.
[63] NI X Y, YUAN L, LÜ K. Efficient Single-Object Tracker Based on Local-Global Feature Fusion. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(2): 1114-1122.
[64] WANG Z A, LI M, PEI W J, et al. Exploring the Complementarity between Convolution and Transformer Matching for Visual Tra-cking. Knowledge-Based Systems, 2024, 300. DOI: 10.1016/j.knosys.2024.112184.
[65] XIONG J B, LING Q.Mask-Guided Siamese Tracking with a Frequency-Spatial Hybrid Network. IEEE Transactions on Circuits and Systems for Video Technology, 2025, 35(1): 103-117.
[66] SONG Z K, YU J Q, CHEN Y P, et al. Transformer Tracking with Cyclic Shifting Window Attention // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 8781-8790.
[67] GAO S Y, ZHOU C L, MA C, et al. AiATrack: Attention in Atten-tion for Transformer Visual Tracking // Proc of the 17th European Conference on Computer Vision. Berlin, Germany: Springer, 2022: 146-164.
[68] FU Z H, FU Z H, LIU Q J, et al. SparseTT: Visual Tracking with Sparse Transformers // Proc of the 31st International Joint Confe-rence on Artificial Intelligence. San Francisco, USA: IJCAI, 2022: 905-912.
[69] LIANG Y, LI Q Q, LONG F M.Global Dilated Attention and Target Focusing Network for Robust Tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37(2): 1549-1557.
[70] MA F, SHOU M Z, ZHU L C, et al. Unified Transformer Tracker for Object Tracking // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 8771-8780.
[71] ZHOU Z K, CHEN J Q, PEI W J, et al. Global Tracking via Ensemble of Local Trackers // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 8751-8760.
[72] LIU T P, LI J, WU J, et al. Tracking with Saliency Region Transformer. IEEE Transactions on Image Processing, 2024, 33: 285-296.
[73] SUN X L, SUN H J, JIANG S, et al. Multi-attention Associate Prediction Network for Visual Tracking. Neurocomputing, 2025, 614. DOI: 10.1016/j.neucom.2024.128785.
[74] HE K J, ZHANG C L, XIE S, et al. Target-Aware Tracking with Long-Term Context Attention. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37(1): 773-780.
[75] TANG C M, WANG X, BAI Y C, et al. Learning Spatial-Frequency Transformer for Visual Object Tracking. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(9): 5102-5116.
[76] TANG C M, HU Q T, ZHOU G F, et al. Transformer Sub-Patch Matching for High-Performance Visual Object Tracking. IEEE Transactions on Intelligent Transportation Systems, 2023, 24(8): 8121-8135.
[77] CHEN L K, GAO L, JIANG Y, et al. Local-Global Self-Attention for Transformer-Based Object Tracking. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(12): 12316-12329.
[78] CUI Y T, JIANG C, WANG L M, et al. MixFormer: End-to-End Tracking with Iterative Mixed Attention // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 13598-13608.
[79] WU H P, XIAO B, CODELLA N, et al. CvT: Introducing Convolutions to Vision Transformers // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 22-31.
[80] XIE F, WANG C Y, WANG G T, et al. Correlation-Aware Deep Tracking // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 8741-8750.
[81] XIE F, YANG W K, WANG C Y, et al. Correlation-Embedded Transformer Tracking: A Single-Branch Framework. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(12): 10681-10696.
[82] HE K M, CHEN X L, XIE S N, et al. Masked Autoencoders Are Scalable Vision Learners // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 15979-15988.
[83] LAN J P, CHENG Z Q, HE J Y, et al. ProContEXT: Exploring Progressive Context Transformer for Tracking // Proc of the IEEE International Conference on Acoustics, Speech and Signal Proce-ssing. Washington, USA: IEEE, 2023. DOI: 10.1109/ICASSP49357.2023.10094971.
[84] CAI Y D, LIU J, TANG J, et al. Robust Object Modeling for Vi-sual Tracking // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2023: 9555-9566.
[85] XIE F, CHU L, LI J H, et al. VideoTrack: Learning to Track Objects via Video Transformer // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 22826-22835.
[86] WU Q Q, YANG T Y, LIU Z Q, et al. DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 14561-14571.
[87] ZHAO H J, WANG D, LU H C. Representation Learning for Vi-sual Object Tracking by Masked Appearance Transfer // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 18696-18705.
[88] GAO S Y, ZHOU C L, ZHANG J. Generalized Relation Modeling for Transformer Tracking // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 18686-18695.
[89] YANG D W, HE J F, MA Y C, et al. Foreground-Background Distribution Modeling Transformer for Visual Object Tracking // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2023: 10083-10093.
[90] CHEN T, SAXENA S, LI L L, et al. Pix2Seq: A Language Mo-deling Framework for Object Detection[C/OL].[2025-11-10]. https://openreview.net/pdf?id=e42KbIw6Wb.
[91] CHEN X, PENG H W, WANG D, et al. SeqTrack: Sequence to Sequence Learning for Visual Object Tracking // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 14572-14581.
[92] WEI X, BAI Y F, ZHENG Y C, et al. Autoregressive Visual Tracking // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 9697-9706.
[93] BAI Y F, ZHAO Z Y, GONG Y H, et al. ARTrackV2: Prompting Autoregressive Tracker Where to Look and How to Describe // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Re-cognition. Washington, USA: IEEE, 2024: 19048-19057.
[94] SHI J Z, YU Y, HUI B, et al. Historical States Modeling for Vi-sual Tracking. Neural Computing and Applications, 2025, 37(7): 5831-5848.
[95] LIN L T, FAN H, ZHANG Z P, et al. Tracking Meets Lora: Faster Training, Larger Model, Stronger Performance // Proc of the 18th European Conference on Computer Vision. Berlin, Germany: Springer, 2024: 300-318.
[96] CAI W R, LIU Q J, WANG Y H. HIPTrack: Visual Tracking with Historical Prompts // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 19258-19267.
[97] SHI L T, ZHONG B N, LIANG Q H, et al. Explicit Visual Prompts for Visual Object Tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(5): 4838-4846.
[98] ZHENG Y Z, ZHONG B N, LIANG Q H, et al. ODTrack: Online Dense Temporal Token Learning for Visual Tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(7): 7588-7596.
[99] XIE J X, ZHONG B N, MO Z Y, et al. Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 19300-19309.
[100] LI S W, YANG Y X, ZENG D, et al. Adaptive and Background-Aware Vision Transformer for Real-Time UAV Tracking // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2023: 13989-14000.
[101] KOU Y T, GAO J, LI B, et al. ZoomTrack: Target-Aware Non-Uniform Resizing for Efficient Visual Tracking // Proc of the 37th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2023: 50959-50977.
[102] YANG X Y, ZENG D, WANG X C, et al. Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking. Pattern Recognition, 2025, 161. DOI: 10.1016/j.patcog.2024.111278.
[103] LI Y X, LIU M Y, WU Y, et al. Learning Adaptive and View-Invariant Vision Transformer for Real-Time UAV Tracking // Proc of the 41st International Conference on Machine Learning. New York, USA: ACM, 2024: 28403-28420.
[104] WU Y, WANG X C, ZENG D, et al. Learning Motion Blur Robust Vision Transformers with Dynamic Early Exit for Real-Time UAV Tracking[C/OL].[2025-11-10]. https://arxiv.org/pdf/2407.05383.
[105] ZHU J W, CHEN X, DIAO H W, et al. Exploring Dynamic Transformer for Efficient Object Tracking. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36(8): 15502-15514.
[106] XUE C C, ZHONG B N, LIANG Q H, et al. Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Re-cognition. Washington, USA: IEEE, 2025: 6730-6740.
[107] ZHU J W, TANG H Y, CHEN X, et al. Two-Stream Beats One-Stream: Asymmetric Siamese Network for Efficient Visual Trac-king. Proceedings of the AAAI Conference on Artificial Intelligence, 2025, 39(10): 10959-10967.
[108] CUI Y T, SONG T H, WU G S, et al. MixFormerV2: Efficient Fully Transformer Tracking // Proc of the 37th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2023: 58736-58751.
[109] WU Y, LI Y X, LIU M Y, et al. Learning an Adaptive and View-Invariant Vision Transformer for Real-Time UAV Tracking. IEEE Transactions on Circuits and Systems for Video Technology, 2026, 36(2): 2403-2418.
[110] HONG L Y, LI J L, ZHOU X Y, et al. General Compression Framework for Efficient Transformer Object Tracking // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2025: 13427-13437.
[111] DONG S H, FENG Y H, YANG Q, et al. LoReTrack: Efficient and Accurate Low-Resolution Transformer Tracking[C/OL].[2025-11-10]. https://arxiv.org/pdf/2405.17660.
[112] LI S W, YANG X Y, WANG X C, et al. Learning Target-Aware Vision Transformers for Real-Time UAV Tracking. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62. DOI: 10.1109/TGRS.2024.3417400.
[113] WU Y, WANG X C, YANG X Y, et al. Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking // Proc of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2025: 17103-17113.
[114] CHEN X, KANG B, WANG D, et al. Efficient Visual Tracking via Hierarchical Cross-Attention Transformer // Proc of the 17th European Conference on Computer Vision. Berlin, Germany: Springer, 2022: 461-477.
[115] GOPAL G Y, AMER M A. Mobile Vision Transformer-Based Vi-sual Object Tracking[C/OL]. [2025-11-10]. https://papers.bmvc2023.org/0800.pdf.
[116] WEI Q M, ZENG B, LIU J Q, et al. LiteTrack: Layer Pruning with Asynchronous Feature Extraction for Lightweight and Efficient Visual Tracking // Proc of the IEEE International Confe-rence on Robotics and Automation. Washington, USA: IEEE, 2024: 4968-4975.
[117] KANG B, CHEN X, WANG D, et al. Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2023: 9578-9587.
[118] BLATTER P, KANAKIS M, DANELLJAN M, et al. Efficient Visual Tracking with Exemplar Transformers // Proc of the IEEE/CVF Winter Conference on Applications of Computer Vision. Washington, USA: IEEE, 2023: 1571-1581.
[119] GOPAL G Y, AMER M A. Separable Self and Mixed Attention Transformers for Efficient Object Tracking // Proc of the IEEE/CVF Winter Conference on Applications of Computer Vision. Washington, USA: IEEE, 2024: 6694-6703.
[120] WANG S L, CHENG G, LAI P J, et al. Multi-state Tracker: Enhancing Efficient Object Tracking via Multi-state Specialization and Interaction // Proc of the 33rd ACM International Conference on Multimedia. New York, USA: ACM, 2025: 4087-4096.
[121] ZONG C G, CHEN X, ZHAO J, et al. Enhancing the Two-Stream Framework for Efficient Visual Tracking. IEEE Transactions on Image Processing, 2025, 34: 5500-5512.
[122] GU A, DAO T. Mamba: Linear-Time Sequence Modeling with Se-lective State Spaces[C/OL]. [2025-11-10]. https://arxiv.org/pdf/2312.00752.
[123] ZHANG J M, LIANG C, CUI Y T, et al. TrackMamba: Mamba-Transformer Tracking[C/OL].[2025-11-10]. https://openreview.net/pdf?id=V7QRVEZ0le.
[124] XIE J X, ZHONG B N, LIANG Q H, et al. Robust Tracking via Mamba-Based Context-Aware Token Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 2025, 39(8): 8727-8735.
[125] WANG Q W, ZHOU L Y, JIN P C, et al. TrackingMamba: Visual State Space Model for Object Tracking. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024, 17: 16744-16754.
[126] ZHANG C H, LIU L, WEN H, et al. MambaTrack: Exploiting Dual-Enhancement for Night UAV Tracking // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2025. DOI: 10.1109/ICASSP49660.2025.10890855.
[127] KANG B, CHEN X, LAI S M, et al. Exploring Enhanced Contextual Information for Video-Level Object Tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 2025, 39(4): 4194-4202.
[128] LI X H, ZHONG B N, LIANG Q H, et al. MambaLCT: Boosting Tracking via Long-Term Context State Space Model. Proceedings of the AAAI Conference on Artificial Intelligence, 2025, 39(5): 4986-4994.
[129] YU W H, WANG X C. MambaOut: Do We Really Need Mamba for Vision // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2025: 4484-4496.
[130] ZHU J W, LAI S M, CHEN X, et al. Visual Prompt Multi-modal Tracking // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 9516-9526.
[131] WU Z W, ZHENG J L, REN X X, et al. Single-Model and Any-Modality for Video Object Tracking // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Wa-shington, USA: IEEE, 2024: 19156-19166.
[132] CHEN X, KANG B, ZHU J W, et al. Unified Sequence-to-Sequence Learning for Single-and Multi-modal Visual Object Trac-king[C/OL].[2025-11-10]. https://arxiv.org/pdf/2304.14394.
[133] HONG L Y, YAN S L, ZHANG R R, et al. OneTracker: Uni-fying Visual Object Tracking with Foundation Models and Efficient Tuning // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 19079-19091.
[134] CHEN X, KANG B, GENG W T, et al. SUTrack: Towards Simple and Unified Single Object Tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 2025, 39(2): 2239-2247.
[135] ZHANG H P, YUAN D, SHU X, et al. A Comprehensive Review of RGBT Tracking. IEEE Transactions on Instrumentation and Measurement, 2024, 73. DOI: 10.1109/TIM.2024.3436098.
[136] 欧洲,应舸,张大伟,等. RGB-D目标跟踪综述.计算机辅助设计与图形学学报, 2024, 36(11): 1673-1690.
(OU Z, YING G, ZHANG D W, et al. A survey of RGB-Depth Object Tracking. Journal of Computer-Aided Design & Computer Graphics, 2024, 36(11): 1673-1690.)
[137] 张大伟,王炫,何小卫,等.基于深度学习的RGBT目标跟踪研究进展.计算机工程与应用, 2025, 61(19): 43-59.
(ZHANG D W, WANG X, HE X W, et al. Research Progress of RGBT Object Tracking Based on Deep Learning. Computer Engineering and Applications, 2025, 61(19):43-59)
[138] YAN B, JIANG Y, SUN P Z, et al. Towards Grand Unification of Object Tracking // Proc of the 17th European Conference on Computer Vision. Berlin, Germany: Springer, 2022: 733-751.
[139] YAN B, JIANG Y, WU J N, et al. Universal Instance Perception as Object Discovery and Retrieval // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 15325-15336.
[140] WANG J K, WU Z X, CHEN D D, et al. OmniTracker: Uni-fying Visual Object Tracking by Tracking-with-Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025, 47(4): 3159-3174.
[141] WANG J K, CHEN D D, LUO C, et al. OmniViD: A Generative Framework for Universal Video Understanding // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 18209-18220.
[142] YANG C Y, HUANG H W, CHAI W H, et al. SAMURAI: Motion-Aware Memory for Training-Free Visual Object Tracking with SAM2. IEEE Transactions on Image Processing, 2026, 35: 970-982.
[143] YANG J Y, GAO M Q, LI Z, et al. Track Anything: Segment Anything Meets Videos[C/OL].[2025-11-10]. https://arxiv.org/pdf/2304.11968.
[144] ZHU J W, CHEN Z Y, HAO Z Q, et al. Tracking Anything in High Quality[C/OL].[2025-11-10]. https://arxiv.org/pdf/2307.13974.
[145] CHENG Y M, LI L L, XU Y Y, et al. Segment and Track Anything[C/OL].[2025-11-10]. https://arxiv.org/pdf/2305.06558.
[146] VIDENOVIC J，LUKEZIC A，KRISTAN M. A Distractor-Aware Memory for Visual Object Tracking with SAM2 // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington，USA：IEEE，2025：24255-24264.
[147] XIAO Y，ZHAO J C，LU A D，et al. Cross-Modulated Attention Transformer for RGBT Tracking. Proceedings of the AAAI Confe-rence on Artificial Intelligence，2025，39（8）：8682-8690.
[148] XUE Y L，JIN G D，ZHONG B N，et al. FMTrack：Frequency- Aware Interaction and Multi-expert Fusion for RGB-T Tracking. IEEE Transactions on Circuits and Systems for Video Technology，2026，36（2）：1655-1667.
[149] WU K，CHEN H，WANG C R，et al. Hierarchical Instruction-Aware Embodied Visual Tracking[C/OL]. [2025-11-10]. https://arxiv.org/pdf/2505.20710.
[150] WU K，XU S H，CHEN H，et al. VLM Can Be a Good Assistant：Enhancing Embodied Visual Tracking with Self-Improving Visual-Language Models // Proc of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Washington，USA：IEEE，2025：13154-13161.